home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Power Programmierung
/
Power-Programmierung CD 2 (Tewi)(1994).iso
/
doc
/
mir
/
15pattrn
< prev
next >
Wrap
Text File
|
1992-06-29
|
33KB
|
824 lines
═══════════════════════════════════════
5. PATTERNS IN BYTE SEQUENCES
═══════════════════════════════════════
Topic 4 showed how byte distributions help us to
analyze the content of a file. One simple fact may be obscured by
the length of Topic 4... that a byte survey and analysis take very
little time. The survey itself might require one to four seconds.
Reviewing it might involve another thirty seconds.
In Topic 5 we consider sequences of bytes. We want to
identify patterns related to our objectives of:
» extracting searchable content;
» recognizing record separations;
» recognizing field separations; and
» recognizing formatting aids.
═══════════════════════════════════════
5.1 Heads and tails...
first impressions of a file
═══════════════════════════════════════
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Usage head file_name [ line_count ] [/a][/t] > text
Displays in printable format the first line_count lines
within a file; the default is 10 lines. This clone of
the Unix HEAD and TAIL utilities provides a quick check on
the likely contents of a file. If the "/a" option is used,
accented characters are treated as printable text. If
"/t" is specified, the display is of the TAIL of the
file, the LAST line_count lines.
input: Normally an ASCII text file.
output: The specified number of lines is either displayed on the
screen or sent to a file. Each non-printable character is
replaced by an ^ symbol. If any line length exceeds 120
characters, a warning is issued. If any line length
exceeds 1024 or the file includes null bytes, the program
advises that the target file is not ASCII text.
writeup: MIR TUTORIAL ONE, topic 5
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
The HEAD program can be used to get a first impression
of the beginning and of the end of most files, although it is best
for ASCII text. Try for example:
HEAD SVP_TXT
The result shows directly on the screen. Alternately it may be
redirected into a new file. With the command "HEAD SVP_TXT", ten
lines are shown. Next try
HEAD SVP_TXT 4
and then
HEAD SVP_TXT 20
You are shown 4 lines and the next time 20 lines of text. For line
counts greater than 23, be ready to use CTL-S to stop and restart
movement across the screen.
Adding the argument "/t" switches heads to tails.
HEAD SVP_TXT /T
causes the last 10 lines of SVP_TXT to be shown on the screen.
HEAD SVP_TXT 25 /T > TEMP
or
HEAD SVP_TXT /T 25 > TEMP
places the last 25 lines in a file called TEMP. Note the file name
must come first; the order of arguments after that does not matter.
Incidentally, there is no restriction on the number of
lines. I tried HEAD SVP_TXT /T 4000 and found it worked!
≡≡≡≡->> QUESTION:
Input the DOS command "COPY HEAD.C HEAD2.C". Then
revise HEAD2.C so that no file is named, and standard
input is the source of data. Compile the result and
experiment with it. The arguments are simpler, and
their order doesn't matter. What are the dangers of
using HEAD2.C in a DOS environment?
<<-≡≡≡≡
≡≡≡≡->> QUESTION:
Make another copy of HEAD.C and call it TAIL.C. Edit
it so that the resulting program needs no "/t" argument
and always shows the end of a file. Experiment.
<<-≡≡≡≡
Occasionally you might have text containing legitimate
accented characters. To demonstrate the "/a" (accents) argument:
HEAD SVP_TXT 150 /A > TEMP
then
HEAD TEMP /T
then
HEAD TEMP /T /A
What's really happening here? You are taking the top 150 lines of
a file, storing it in a separate temporary file, then displaying
the last 10 lines of the temporary file (that is, lines 141 to 150
of the original file) on the screen. This is a way of showing an
intermediate part of a file... not as fast as CPB (copy bytes), but
convenient. When you try the last two commands above, do you
notice the difference between the two displays? HEAD TEMP /T
includes a word that looks like H^tel; when accents are requested
in HEAD TEMP /T /A, the same word comes out as Hôtel.
≡≡≡≡->> QUESTION:
The experiment fails if you build the temporary file
without the /A argument (HEAD SVP_TXT 150 > TEMP). Why
does it fail?
<<-≡≡≡≡
On a non-text file, HEAD may either show a lot of caret
("^") characters, or conclude that a HEAD display is meaningless.
That information is worth the few seconds used to input the command
and see the result.
═════════════════════════
5.2 Non-DOS files
═════════════════════════
Suppose you display the head of a file and find it
looks like this:
Fourscore and seven years ago,
our forefathers brought forth
upon
this continent
a new nation,
conceived in liberty,
and dedicated
to the proposition
that all men are created equal.
This sample is not 80 characters wide, but you get the idea. Each
new line starts where the last one left off, and lines wrap around
onto the next line when the right margin is reached. This effect
is common when UNIX files are brought into a DOS environment. DOS
needs a carriage return to match each linefeed chararacter.
Here's a simple solution:
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
usage: dosify file_name[s]
Replaces a UNIX-style file with a copy in which each
line feed is preceded by one carriage return, and the
file ends with one CTL-Z byte. Use this program on a
file in which the MORE command produces a skewed listing
that fails to go back to the left margin for new lines.
input: Any printable ASCII file[s].
output: The same file, with the same name, with DOS conventional
line ends and end of file.
writeup: MIR TUTORIAL ONE, topic 5
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
You can dosify a clutch of files in one command:
DOSIFY GETTYSBU.RG LINCOLN DOUGLAS WHATEVER
and the display begins to make sense:
Fourscore and seven years ago,
our forefathers brought forth
upon this continent
a new nation,
conceived in liberty,
and dedicated to the proposition
that all men are created equal.
DOSIFY appears to change files in place. In reality it makes a
copy, and if successful, destroys the original and changes the name
of the copy to match the original.
≡≡≡≡->> QUESTION:
Using A_BYTES on a non-DOS file, how would you
calculate in advance the number of bytes that it will
contain after it is dosified?
<<-≡≡≡≡
═════════════════════════════════════
5.3 Displaying printable data
═════════════════════════════════════
Our immediate objective is to get first impressions of
the content of a file. F_PRINT is a filter to show only printable
characters within a file. Unlike HEAD, it can start instantly at
any point within a file. An accent argument extends the range to
include accented (high-bit-set) characters, but not graphics.
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
usage f_print file_name [/a][/w] [ from_byte to_byte ] > subset
Reduces a file to printable characters only. If the /w
option is specified, strings of printable characters that
are unlikely to be words are filtered out as well, and each
new burst of accepted text is placed on a new line. /a
causes accented characters to be accepted as printable.
input: Any file whatsoever, or any part of a file.
output: Printable subset.
writeup: MIR TUTORIAL ONE, topic 5
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
The command
F_PRINT SVP_TXT /A 121500 121700
causes the following display. Note the accented byte in the two
repetitions of Luçon.
CENT <P10>D<P8>EPAUL<R>i.s.C.M.
@TEXT60 = <MI>Addressed:<D> Monsieur Le Soudier, Priest of the
Mission of Luçon, in Luçon
@HEAD4 = 464. - TO N.
@TEXT4 = Saint-Lazare, Sunday, July 29, 1640
@TEXT7
For some fun, try F_PRINT on an executable EXE file,
first without the /W argument, then with /W. For example,
F_PRINT F_PRINT.EXE
and
F_PRINT /W F_PRINT.EXE
The second listing is much shorter and far more intelligible.
≡≡≡≡->> QUESTION:
In what ways might you amend source code in F_PRINT.C
to get other useful effects? Hint: Try variations in
the function check_store.
<<-≡≡≡≡
═══════════════════════════════
5.4 Detailed data dumps
═══════════════════════════════
Let's move beyond first impressions to methods of
displaying exactly what byte sequences occur in a file or part of
a file.
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Usage dump file_name [/a] [ from_byte [ to_byte ] ] > report
Lists the contents of a specified portion of any file,
reporting 16 bytes per line. "/a" causes accented high bit
characters to be printed.
input: Any file whatsoever.
output: Printable ASCII report, listing offset, then 16 bytes in
hexadecimal format, with printable ASCII on the right;
periods substitute for non-printable bytes.
writeup: MIR TUTORIAL ONE, topic 5
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
DUMP permits commands such as:
DUMP SVP_TXT 0 800 > TOP_END
This preliminary test of the file can be tried on any portion of
the file. Moreover, if your target has 367 or fewer bytes, you can
send the output directly to the screen without worrying about CTL-S
stop and go control, as in:
DUMP SVP_TXT 100000 100366
DUMP restricts the printable character set to the 95
byte patterns ranging from hex 20 (space) up to hex 7E. This
restriction makes it much easier to recognize ordinary text; it is
not surrounded by a jumble of happy faces and graphic characters.
(Try the DOS 5.0 MORE command on any EXE executable file and see
what you get!) Other characters are in the strict sense
printable... carriage returns, line feeds and tabs. For accented
characters using PC compatible extended ASCII, add the accents
argument "/a":
DUMP SVP_TXT /A 11100 11200
Note the accented French word "écus" in the result.
═══════════════════════════════════════════
5.5 Convenient display of fragments
═══════════════════════════════════════════
Suppose we want to check out other high-bit-set bytes
found in the file SVP_TXT. Here is the list created by A_BYTES:
é [82] 43 0.0% 11144 12314 13915 14831 18658 23503
23800 26370
â [83] 1 0.0% 207322
à [85] 1 0.0% 116508
ç [87] 7 0.0% 95180 109218 121610 121620 129909
175862 181966
è [8A] 4 0.0% 130386 130571 161305 232659
î [8C] 4 0.0% 65079 93876 95582 138200
ô [93] 10 0.0% 8834 16736 28121 28656 97731 134953
163316 170678
One way to display a byte at a known location with its
context is to issue a DUMP command that straddles its location.
For example, to view the ç with cedilla at offset 95180:
DUMP SVP_TXT /A 95100 95300
would do the job. But DUMP gives too much detail for this purpose.
The key lines in the screen display are:
95164: 79 65 3c 5e 3e 39 3c 44 3e 20 69 6e 0d 0a 4c 75
ye<^>9<D> in Lu
95180: 87 6f 6e 3f 3c 5e 3e 31 30 3c 44 3e 20 49 20 66
çon?<^>10<D> I f
95196: 69 6e 64 20 69 74 20 64 69 66 66 69 63 75 6c 74
ind it difficult
A more convenient program is FRAGMENT:
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Usage fragment input_file offset > stdout
Display a five line fragment of a file in printable form
with two lines of context on either side of the selected
offset. Useful to get a quick view of contents at a
selected location in a file. Use CPB and/or DUMP for an
alternate method, less convenient, but with more detail.
input: Most useful for printable ASCII files.
output: Five double spaced lines in which non-printing characters
are shown as blank with a ^ in the blank line below. The
character at the exact offset is marked by a | in the blank
line below.
writeup: MIR TUTORIAL ONE, topic 5
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
To display context of the first ç at offset 95180,
simply input:
FRAGMENT SVP_TXT 95180
Five lines are displayed; the third line starts this way:
Luçon?<^>10<D> I find it difficult to
|
Notice how the ç is highlighted by the vertical bar | immediately
underneath.
Try showing the î at byte 138200 with this command:
FRAGMENT SVP_TXT 138200
You are shown the context around:
M. Benoît<^>2<D> does not return
|
and again a vertical bar underneath draws attention to the î.
FRAGMENT works for any ASCII file, particularly where
line lengths are under 80 characters. (Unix users: Many terminals
are unpredictable when they attempt to display bytes with the high
bit set. The source code contains notes indicating where to make
necessary changes.) In the SVP_TXT example, the locations shown
above all check out as valid accented characters within French
names. Further along, we will find that by using the program
A_PATTRN we can verify that all bytes with high bit set in our
sample SVP_TXT are valid.
══════════════════════════════════════════════
5.6 Viewing patterns throughout a file
══════════════════════════════════════════════
The techniques thus far display context at specific
points within a file... the beginning, the end, or near certain
offsets. More is needed. We want to be able to:
» ensure that patterns are consistent across all the
data;
» identify every set of codes and signals that may help
us toward our objectives of interpreting record and
field separators, searchable content, etc.
At the end of the preceding topic, we concluded that
our sample file, SVP_TXT, is extended ASCII text (normal text plus
accented letters), and that the usage of certain characters needs
to be checked out: @ = ^ | < >.
The program A_PATTRN can be used to isolate every
occurrence within a file of a single character or of a string of up
to 16 characters.
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
usage: a_pattrn file_name key [ /x ] [ bytes_before ] > report
"/x" = include hex, show only 16 bytes instead of 40
List every occurrence of a key character or string in a file.
Show 3 (or "bytes_before", range 0 to 15) bytes prior to the
key each time. Normally show a total of 40 bytes each time
the key is found; if the "/x" argument is set, show only 16
bytes, but in hex and ASCII both. The key may be from 1 to
16 characters. Within the key, any non-printing characters,
characters which may confuse DOS (> or < or |), linefeeds,
blanks, backslash, etc. must be shown in hex form... a
backslash and 2 hex digits. Examples:
a_pattrn herfile \8E > herfile.8e
a_pattrn yourfile * 7 > yourfile.ast
a_pattrn myfile Mother
a_pattrn hisfile \94\05ke\ff 0 > 5char.pat
input: Any file whatsoever.
output: One line for each occurrence of the target byte(s) in the
file. Sort the result to make patterns show up more clearly.
writeup: MIR TUTORIAL ONE, topic 5
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
DOS assigns meaning to certain characters (such as
space, | < >, etc.), so if you have any problem using the A_PATTRN
command, switch to the hex format for the search key (the letter or
sequence of letters on which you wish to search). For example,
A_PATTRN SVP_TXT > > SVP3E
fails, but
A_PATTRN SVP_TXT \3E > SVP.3E
works.
The key benefit of A_PATTRN is that it selects the same
byte or string and places it in the same position on each line.
Patterns begin to emerge at once. Here are the first few lines
produced by:
A_PATTRN SVP_TXT \3C\5E > RESULT
00000641: men<^>2<D> want to communicate..in writi
00001138: ot]<^>3<D> concern the hospital, you..ca
00001467: uf,<^>4<D> on Madame..Goussault's<^>5<D>
00001497: t's<^>5<D> estates, I believe, although
00001678: eu,<^>6<D> and to do so as soon as possi
00002197: ne,<^>2<D> which I see from your..letter
00003152: is;<^>3<D> who put together his council
00003412: us?<^>4<D> Indeed, there is no good or..
00003747: ey,<^>5<D> who on another..occasion spok
00005086: eva<^>6<D> and,..following his example,
\3C\5E appears as <^ starting at byte 17 in each line. It is a
simple matter to perform a sort:
SORT /+17 < RESULT > RESULT.SRT
The hexadecimal output from A_PATTRN (when argument /x
is used) looks just like the output of DUMP. Here we have
shortened the lines a bit.
00000641: 6d 65 6e 3c 5e 3e 32 3c 44 3e 20... men<^>2<D> want
00001138: 6f 74 5d 3c 5e 3e 33 3c 44 3e 20... ot]<^>3<D> conce
00001467: 75 66 2c 3c 5e 3e 34 3c 44 3e 20... uf,<^>4<D> on Ma
00001497: 74 27 73 3c 5e 3e 35 3c 44 3e 20... t's<^>5<D> estat
00001678: 65 75 2c 3c 5e 3e 36 3c 44 3e 20... eu,<^>6<D> and t
00002197: 6e 65 2c 3c 5e 3e 32 3c 44 3e 20... ne,<^>2<D> which
00003152: 69 73 3b 3c 5e 3e 33 3c 44 3e 20... is;<^>3<D> who p
00003412: 75 73 3f 3c 5e 3e 34 3c 44 3e 20... us?<^>4<D> Indee
00003747: 65 79 2c 3c 5e 3e 35 3c 44 3e 20... ey,<^>5<D> who o
00005086: 65 76 61 3c 5e 3e 36 3c 44 3e 20... eva<^>6<D> and,.
The hex result can also be sorted (SORT /+20). When dealing with
fully printable files, the hex rendition of each byte is not
particularly useful. The one piece of information the hex version
provides is that the ".." pattern within the ASCII is usually a
line feed - carriage return combination.
Whichever output is selected, we discover that the two
bytes "<^" in the file SVP_TXT are in every case followed by ">",
a one or two digit number, then "<D>". The lowest numbers, 1 and
2, are most frequent. The frequency falls off steadily so that the
highest, "<^>27<D>" occurs only once.
Looking at the patterns around the single character hex
3C ("^") alone reveals two other combinations: "<B^>#<D>" and
"<I^>#<D>". The three basic patterns <^>, <B^> and <I^> account
for all occurrences of the caret character "^".
≡≡≡≡->> QUESTION:
DOS 5.0 has a "FIND" command which can also be used to
list every line in which a character sequence appears.
Compare the respective advantages of FIND and A_PATTRN.
<<-≡≡≡≡
═════════════════════════════════════════
5.7 The power of sorting patterns
═════════════════════════════════════════
Suppose we look for patterns around the single "at
sign" (@) character:
A_PATTRN SVP_TXT @ > AT_SIGN.SVP
The result contains 935 lines which start out as follows:
00000000: ...@HEAD1 = SAINT VINCENT DE PAUL..@HEAD
00000032: L..@HEAD2 = CORRESPONDENCE..@HEAD4 = 417
00000057: E..@HEAD4 = 417. - TO SAINT LOUISE DE MA
00000121: S..@TEXT4 = Paris, January 11, 1640..@TE
00000155: 0..@TEXT7 = Mademoiselle,..@TEXT6 = I re
00000179: ,..@TEXT6 = I received three letters fro
00000605: ...@TEXT6 = Seeing that those Gentlemen<
00001613: ...@TEXT6 = You would do well to send fo
00001799: ...@TEXT6 = People are praying to God fo
00001941: ...@HEAD4 = 418. - TO LOUIS ABELLY,<B^>1
We may have tripped upon the record separators and field separators
that we are looking for. Notice particularly the pattern @HEAD4 =
which is followed by a number. We have several options at this
point. One is to lengthen the key and re-run the pattern analysis:
A_PATTRN SVP_TXT @HEAD > AT_HEAD.SVP
and
A_PATTRN SVP_TXT @TEXT > AT_TEXT.SVP
Alternately, since the earlier listing AT_SIGN.SVP has
only 51,425 bytes, we can sort it beginning at column 17:
SORT /+17 < AT_SIGN.SVP > AT_SIGN.SRT
As we view the sorted result, patterns become very clear. Here is
part of the analysis that I reported after a few more tests with
the A_PATTRN program. At this point, analysis was still tentative,
but it provided a good basis for discussion with the database
provider.
Analysis of SVP_TXT
ASCII text with Printer's Codes (Headers)
February 12, 1992
The following are tentative interpretations to the
Printer's Codes embedded in SVP_TXT. Corrections to errors would
be welcome.
@HEAD1 database heading, 1 occurrence at beginning
@HEAD2 database subheading, 1 occurrence "
@HEAD4 sequence number, 1 per letter
@TEXT31 dateline
@TEXT4 dateline, letter from s.v.p.
@TEXT41 dateline, letter to s.v.p. from other person
@TEXT5 signature line, s.v.p.
@TEXT51 signature line, other person
@TEXT6 paragraph start, letter from s.v.p.
@TEXT60
@TEXT600
@TEXT61 paragraph start, letter from other person
@TEXT611 address line to s.v.p.
@TEXT7 salutation from s.v.p.
@TEXT71 salutation to s.v.p.
<169> beginning quote (7X)
<170> end quote (7X)
<197> dash (23X)
<B^>1<D> superscript footnote ref in heading (11X)
<D> terminator for other < > symbols (594X)
<I^>#<D> superscript footnote ref in heading (59X)
<M> emphasis -- bold, highlight, italics? (6X)
<MI> emphasis -- bold, etc.? (134X)
<P10>, <P7>, <P7M>, <P8>, <P8MI>, <P9>
pica measures for font size (212X)
<R> ?? 18 of 21X <R>i.s.C.M. in signature
<^>#<D> superscript footnote ref in text (409X)
<|> blank position holder, not to end line (28X)
═══════════════════════════════
5.8 Sorting large files
═══════════════════════════════
As files get larger, the DOS SORT slows down. Sorting
the 935 lines (51,425 bytes) in AT_SIGN.SVP took 30 seconds on a 12
megahertz AT clone. As the SORT 64k byte limit is approached,
things fall apart.
A description follows for SORT2, a device to get around
the 64k limit. It's not elegant, but it works!
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Usage - sort2 [/r] [/+n] from_file to_file key[s]
Sorts large ASCII text files using the memory-bound DOS SORT
routine in multiple passes. /r signifies reverse order.
/+n specifies a starting column, 1-999. A key is 1 to 3
characters, used as a dividing point. The program separates
the input file into a series of temporary files, depending on
the byte(s) at the starting column. For n dividing points,
the program makes n+1 temporary files, and reports the size
of each. If all are under 60k characters, they are sorted
and placed together in the output file. If a run fails, add
another dividing point mid-way in the range that fails (that
is, the file that is too big), and try again. NOTE: The DOS
SORT starts column count at 1, converts all lower to upper
case!
input: Line oriented printable ASCII.
output: Same file, sorted.
writeup: MIR TUTORIAL ONE, topic 5
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
≡≡≡≡->> QUESTION:
Try out SORT2, get a feel for how it works. Can you
come up with ways to make it easier to use or more
powerful? Or do you have your own super sort that you
are willing to publish under copyleft rules?
<<-≡≡≡≡
Another way to speed up sorts is to throw away portions
of the target file that are not essential for the purpose you have
in mind when sorting. For example, the program A_PATTRN produces
8 byte offsets followed by a colon and white space, and up to 40
bytes of information. Are they all necessary?
The program COLRM removes the same columns from every
line of ASCII text in a file:
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Usage - colrm from_col to_col < printable_ascii > revised_ascii
Removes the specified range of columns from each line of an
ASCII file. This is a clone of the Unix "colrm" utility.
input: A printable ASCII file with less than 512 characters per
line. Columns number from 1 upward.
output: The same number of lines, but with one segment of columns
removed from each line.
writeup: MIR TUTORIAL ONE, topic 5
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Since ASCII text is the only accepted input, we are
safe in a DOS environment to use standard input and output. There
is no confusion over line feeds and CTL-Z characters. An added
benefit is that we can pipe the output of successive runs of COLRM.
Recall the earlier example
A_PATTRN SVP_TXT \3C\5E > RESULT
which produced output that started off like this:
00000641: men<^>2<D> want to communicate..in writi
00001138: ot]<^>3<D> concern the hospital, you..ca
00001467: uf,<^>4<D> on Madame..Goussault's<^>5<D>
00001497: t's<^>5<D> estates, I believe, although
00001678: eu,<^>6<D> and to do so as soon as possi
00002197: ne,<^>2<D> which I see from your..letter
00003152: is;<^>3<D> who put together his council
00003412: us?<^>4<D> Indeed, there is no good or..
00003747: ey,<^>5<D> who on another..occasion spok
00005086: eva<^>6<D> and,..following his example,
Our primary interest is in the patterns of the form
<^>#<D> and <^>##<D>. We could remove the first 16 columns by:
COLRM 1 16 < RESULT > TEMP
and those that follow the 8 characters of interest by
COLRM 9 99 < TEMP > RESULT2
Notice that you can use any large number that will reach to the end
of all lines. Alternately, you can do the two steps in one:
COLRM 1 16 < RESULT | COLRM 9 99 > RESULT2
RESULT2 has only 4,090 bytes, in contrast to the 22,086 in RESULT.
The new file starts off like this:
<^>2<D>
<^>3<D>
<^>4<D>
<^>5<D>
<^>6<D>
<^>2<D>
<^>3<D>
<^>4<D>
<^>5<D>
<^>6<D>
Eighty per cent reduction in a file size pays off when sorting.
A_OCCUR is useful in analyzing sorted files that
contain many repetitions.
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Usage - a_occur [ min_freq ] [ /n ] < ascii_text > report
/n = non-sequenced data is okay
Count the frequency of occurrence of identical lines
If a minimum frequency is specified, lines occurring
fewer times are dropped entirely from the result.
Input: ASCII text, which must be in sorted order UNLESS the
flag "/n" is included.
Output: A reduced copy of the file with each line shown only
once. Each line begins with a frequency count, padded
out to six characters with blanks.
Writeup: MIR TUTORIAL ONE, topic five.
See also the related programs A_OCCUR2 and A_OCCUR3.
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Here is the top end of the output when we input the
command
A_OCCUR < RESULT2 > FREQ
10 <^>10<D>
8 <^>11<D>
6 <^>12<D>
5 <^>13<D>
5 <^>14<D>
3 <^>15<D>
3 <^>16<D>
3 <^>17<D>
2 <^>18<D>
2 <^>19<D>
Frequency is the first element. For example, the pattern <^>11<D>
occurs 8 times, <^>12<D> occurs 6 times. It was the regularly
declining frequency of the numbers that first suggested to me that
these tags indicate footnote numbers within the test file SVP_TXT.
To finish this topic, we mention two simple utility
programs that are related to A_OCCUR.
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
usage - a_occur2 [ min_frequency [ filename_under_min ] ]
< merged a_occur files > combined
A utility to calculate cumulative frequency of
merged A_OCCUR outputs. If a minimum frequency is
specified, then all lower frequency items are either
suppressed or sent to a file named in the next argument.
Input: ASCII text, in which each line starts with a number
(a frequency count) followed by blanks, then sorted text
starting in the seventh column.
Output: A copy of the same file in which multiple identical lines
are shown only once, preceded by the combined frequency
count.
Writeup: MIR TUTORIAL ONE, topic five.
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Usage a_occur3 < occur_file > expanded_file
Reverse an A_OCCUR file by removing the initial count,
then outputting each line the number of times indicated
by the count. Useful if editing an A_OCCUR file, then
reconstituting it.
input: ASCII file with each line containing a count, blank
padded to the sixth character, then the line content.
output: Same content, but with leading six characters removed
and content repeated for "count" lines.
Writeup: MIR TUTORIAL ONE, topic five.
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
* * * * *
We have identified several methods of viewing portions
of a computer file. Each is an aid in analyzing file content. The
most powerful aid is the A_PATTRN program. Its output may be
sorted so that the context of any character or sequence of up to 16
characters may be examined.
Interpreting the results becomes easier as you acquire
experience with various kinds of data. The next few topics offer
additional tools and pointers for analysis.